Introduction to HPC

Scientific Programming with Python

Manuel Holtgrewe

Berlin Institute of Health at Charité

Session Overview

Aims

  • Learn about important applications for Python in scientific programming …
  • … in the biomedical/life sciences.

Considered Applications

  • “Data science wrangling” with polars
  • Fast numerical computations with SciPy/Numpy
  • Interfacing with R for Statistics
  • Machine learning with scikit-learn
  • Deep Learning with Tensorflow

Jupyter Notebooks

  • Introduction
  • Installation and Usage

Introduction

  • Jupyter Notebooks are a web-based interactive computing environment
    • Jupyter Lab is the next generation
  • You edit cells with python code which is executed in a kernel
  • It’s best explained by example ;-)

Installation

  • Jupyter is installed as a Python module.

  • To install it locally with conda

    mamba create -y -n jupyter-example jupyterlab
    conda activate jupyter-example
    jupyter lab # OR: jupyter notebook
    # OUTPUT:
    ...
    http://127.0.0.1:8888/?token=a5eea49bec7cd70538c21e947e2de596a3ab66b39506ee72
    ...
  • More Kernels available

    • Bash
    • R
    • Julia

Using Jupyter (1)

  • There is no magic!
  • Packages can be installed in the conda environment (or via pip)
    • mamba install -y plotly_express
  • The kernel keeps the current state
    • To ensure that you have a clean state
    • restart kernel and rerun all cells
  • Bonus: ipython is a turbo-charged interpreter on the command line
  • Useful: Markdown cells for documentation

Using Jupyter (2)

Jupyter Lab vs. Notebook

  • Originally: ipython notebook
  • Jupyter Notebook: single file
  • Jupyter Lab: IDE-like, terminal, …

Remarks

  • .ipynb files are JSON files
  • They contain all code but also all output and log output (stderr)
  • To clear all output: jupyter nbconvert --clear-output --to notebook --output=OUTPUT INPUT.ipynb

Using Jupyter (3)

Advantages

  • Really nice to try out things interactively
  • Excellent for interactive data analysis / visualization

Drawbacks

  • Code cannot be reused (use Python modules for this)
  • Version control with Git feasible but not ideal

Using Jupyter on the HPC

  • Super easy to get started locally!
  • Also easy to launch a job on the HPC, but …
    • how to connect to its web interface?
  • Options:

Data Wrangling with Polars

  • Introduction
  • Loading and Writing Data
  • “Tidy Data” with Polars
  • Data Visualization

Introduction

What is Polars?

  • Polars is a DataFrame library for Python (written in Rust)
    • mamba install -y polars
  • It is (almost a) drop-in replacement for Pandas
  • Think “super-fast R-like data frames in Python”
    • also available for Rust, R, NodeJS
  • Python Documentation

Polars vs. Pandas

  • Pandas is the original library and de-facto standard
  • Nobody is particularly happy with Pandas (including its author)
    • In particular with Pandas indexing
  • Polars is a new library that tries to fix several issues
    • Polars has no indexing ;-)

Loading and Writing Data (1)

In [1]: import polars as pl
   ...: iris_data = pl.read_csv("https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv")
   ...: iris_data.head()
Out[1]:
shape: (5, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
│ sepal.length ┆ sepal.width ┆ petal.length ┆ petal.width ┆ variety │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ str     │
╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡
│ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ Setosa  │
│ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ Setosa  │
│ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ Setosa  │
│ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ Setosa  │
│ 5.0          ┆ 3.6         ┆ 1.4          ┆ 0.2         ┆ Setosa  │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘

Loading and Writing Data (2)

In [2]: iris_data.write_csv("/tmp/out.tsv", separator="\t")

In [3]: with open("/tmp/out.tsv") as inputf:
   ...:     print(inputf.readline())
   ...:     print(inputf.readline())
   ...:
sepal.length    sepal.width     petal.length    petal.width     variety

5.1     3.5     1.4     0.2     Setosa

Tidy Data with Polars (1)

  • Hadley Wickham wrote a whole paper about tidy data - he is probably the “most influential R guy”
  • Tidy data is a way to organize data in a tabular format
    • each variable is a column
    • each observation is a row
    • each type of observational unit is a table
  • This section is based on Kevin Heavey’s M”Modern Polars”

Tidy Data with Polars (2)

Original data:

date,away_team,away_points,home_team,home_points
"Fri, Apr 1, 2016",Philadelphia 76ers,91,Charlotte Hornets,100
"Fri, Apr 1, 2016",Dallas Mavericks,98,Detroit Pistons,89
"Fri, Apr 1, 2016",Brooklyn Nets,91,New York Knicks,105
"Fri, Apr 1, 2016",Cleveland Cavaliers,110,Atlanta Hawks,108
"Fri, Apr 1, 2016",Toronto Raptors,99,Memphis Grizzlies,95

Tidy Data with Polars (3)

Load and clean the data.

import polars as pl
games_pl = (
    pl.read_csv("nba.csv")
    .filter(~pl.all(pl.all().is_null()))
    .with_columns(
        pl.col("date").str.strptime(pl.Date, "%a, %b %d, %Y"),
    )
    .sort("date")
    .with_row_count("game_id")
)
games_pl.head()

Output

┌─────────┬────────────┬─────────────────────┬─────────────┬───────────────────┬─────────────┐
│ game_id ┆ date       ┆ away_team           ┆ away_points ┆ home_team         ┆ home_points │
│ ---     ┆ ---        ┆ ---                 ┆ ---         ┆ ---               ┆ ---         │
│ u32     ┆ date       ┆ str                 ┆ i64         ┆ str               ┆ i64         │
╞═════════╪════════════╪═════════════════════╪═════════════╪═══════════════════╪═════════════╡
│ 0       ┆ 2016-04-01 ┆ Philadelphia 76ers  ┆ 91          ┆ Charlotte Hornets ┆ 100         │
│ 1       ┆ 2016-04-01 ┆ Dallas Mavericks    ┆ 98          ┆ Detroit Pistons   ┆ 89          │
│ 2       ┆ 2016-04-01 ┆ Brooklyn Nets       ┆ 91          ┆ New York Knicks   ┆ 105         │
│ 3       ┆ 2016-04-01 ┆ Cleveland Cavaliers ┆ 110         ┆ Atlanta Hawks     ┆ 108         │
│ 4       ┆ 2016-04-01 ┆ Toronto Raptors     ┆ 99          ┆ Memphis Grizzlies ┆ 95          │
└─────────┴────────────┴─────────────────────┴─────────────┴───────────────────┴─────────────┘

Data Operations

  • Polars (just as Pandas and Hadley’s tidyverse for R) has a rich set of operations
    • Cheat Sheet
    • pivot - “long” to “wide” format
    • melt - “wide” to “long” format
    • join - join two tables
    • … much more!
  • Suggested exercise for later:

Data Visualization (1)

  • Plotly Express is a high-level plotting library for Python
    • It is based on Plotly which is more low-level
    • This is really nice to create interactive plots!
  • Suggested exercise for later:

Data Visualization (2)

  • Vega is a declarative language for data visualization
    • It is used by Altair which is a Python library
    • It is also used by Vega Lite which is a JavaScript library
    • Vega Lite is used by Voyager which is a web-based data exploration tool
  • This is more advanced but also very powerful

Data Visualization (3)

Data Visualization (4)

  • Matplotlib is the “original” plotting library for Python
    • it offers pretty low-level plotting functions only
    • Seaborn adds higher level plotting functions
  • Here is the matplotlib tutorial
  • Historically, it can be seen as part of the “MATLAB functionality for Python” effort
    • 1995 NumPy: fast numerics for Python
    • 2001 SciPy: scientific functions for Python, builds on NumPy
    • 2003 Matplotlib: plotting for Python, builds on NumPy

Conversion Between Polars and Pandas (1)

But - why?!

  • Polars is a new library
  • Pandas is the de-facto standard
  • Certain third-party libraries may not work with Pandas yet
    • e.g., scikit-learn
    • e.g., rpy2

Installing dependencies

mamba create -y -n polars-example \
    polars pandas pyarrow

Conversion Between Polars and Pandas (2)

Polars to Pandas

In [1]: import polars as pl
   ...: import pandas as pd
   ...:
   ...: pl_data = pl.DataFrame({"name": ["alice", "bob"], "age": [23, 52]})
   ...: pd_data = pl_data.to_pandas()
   ...:

In [2]: type(pl_data)
Out[2]: polars.dataframe.frame.DataFrame

In [3]: type(pd_data)
Out[3]: pandas.core.frame.DataFrame

Conversion Between Polars and Pandas (3)

Pandas to Polars

In [1]: import polars as pl
   ...: import pandas as pd
   ...:
   ...: pd_data = pd.DataFrame({"name": ["alice", "bob"], "age": [23, 52]})
   ...: pl_data = pl.from_pandas(pd_data)

In [2]: type(pd_data)
Out[2]: pandas.core.frame.DataFrame

In [3]: type(pl_data)
Out[3]: polars.dataframe.frame.DataFrame

Numerical Computation with SciPy/NumPy

  • Introduction
  • Input and Output
  • Arrays and Shapes
  • Vectorized Operations

Introduction to SciPy/NumPy

NumPy

  • Numerical computing data structures
    • high-performance vectors/matrices
  • Numerical computing algorithms
    • vectorized operations, random numbers …
    • linear algebra

SciPy

  • Fundamental scientific algorithms
    • clustering
    • interpolation / smoothing
    • statistics
    • sparse matrices
    • image and signal processing
    • FFT, integration

👉 Mostly low level characteristics, many other libraries build on top.

Input and Output (1)

Saving single array to .npy format

import numpy as np

mat = np.mat("1 2; 3 4")
np.save("mat.npy", mat)

Saving multiple arrays to .npz format (ZIP of .npy)

import numpy as np

arr = np.array([1, 2, 3, 4])
mat = np.mat("1 2; 3 4")
np.savez("data.npz", arr=arr, mat=mat)

Input and Output (2)

Loading single array to .npy format

In [1]: import numpy as np
   ...:
   ...: mat = np.load("mat.npy")
   ...: mat
Out[1]:
array([[1, 2],
       [3, 4]])

Loading multiple arrays from .npz:

In [1]: import numpy as np
   ...:
   ...: data = np.load("data.npz")
   ...:

In [2]: data.keys()
Out[2]: KeysView(NpzFile 'data.npz' with keys: arr, mat)

In [3]: data["arr"]
Out[3]: array([1, 2, 3, 4])

In [4]: data["mat"]
Out[4]:
array([[1, 2],
       [3, 4]])

Arrays and Shapes

  • arrays are the fundamental data structure
  • the shape attribute determines the dimension
In [1]: import numpy as np

In [2]: arr = np.array([1, 2, 3, 4])

In [3]: arr
Out[3]: array([1, 2, 3, 4])

In [4]: arr.shape
Out[4]: (4,)

In [5]: arr.shape = (2, 2)

In [6]: arr
Out[6]:
array([[1, 2],
       [3, 4]])

Vectorized Operations (1)

  • vectorized operations are the key to fast numerical computations
  • they are implemented in C and Fortran
In [1]: import numpy as np
   ...:
   ...: a = np.array([1,2,3,4])
   ...: b = np.array([5,6,7,8])
   ...:

In [2]: np.add(a, b)
Out[2]: array([ 6,  8, 10, 12])

In [3]: np.multiply(a, b)
Out[3]: array([ 5, 12, 21, 32])

Vectorized Operations (2)

Plain Python

In [1]: import timeit
   ...:
   ...: a = range(100_000_000)
   ...: b = range(100_000_000)
   ...:
   ...: timeit.timeit("[x + y for x, y in zip(a, b)]", globals=globals(), number=1)
   ...:
Out[1]: 8.240364263998345

With numpy

In [1]: import numpy as np
   ...: import timeit
   ...:
   ...: npa = np.arange(100_000_000)
   ...: npb = np.arange(100_000_000)
   ...:
   ...: timeit.timeit("np.add(npa, npb)", globals=globals(), number=1)
   ...:
Out[1]: 0.16427204399951734

SciPy Example

In [1]: import numpy as np
   ...: from scipy import linalg
   ...:
   ...: A = np.array([[1,3,5],[2,5,1],[2,3,8]])
   ...: A
Out[1]:
array([[1, 3, 5],
       [2, 5, 1],
       [2, 3, 8]])

In [2]: linalg.inv(A)
Out[2]:
array([[-1.48,  0.36,  0.88],
       [ 0.56,  0.08, -0.36],
       [ 0.16, -0.12,  0.04]])

In [3]: A.dot(linalg.inv(A)) #double check
Out[3]:
array([[ 1.00000000e+00, -1.11022302e-16,  4.85722573e-17],
       [ 3.05311332e-16,  1.00000000e+00,  7.63278329e-17],
       [ 2.22044605e-16, -1.11022302e-16,  1.00000000e+00]])

Summary / Exercise

Summary

  • NumPy provides fast numerical data structures and operations
  • SciPy provides scientifc functions / algorithms
  • By handing the operations to this C/Fortran code, the code will run faster than native Python code

Potential Exercises

  • Implement your own vectorized function (How-To)
  • Try out the random number generator
  • Use the statistics functions such as average, quantile, etc.

Interfacing with R for Statistics

  • Introduction
  • Low-Level Approaches
  • Using rpy2

Introduction

Why interface to R with Python?

  • Python is an excellent general purpose language
    • great for scientific programming
    • increasing support for statistics (statsmodels, linearmodels, …)
  • R is the de facto standard for statistics
    • all the methods, all the features
    • excellent machine learning, biostatistics, …

rpy2 Installation

mamba create -y -n rpy2-example \
    python=3.10 rpy2 jupyterlab pandas \
    pyarrow polars
conda activate rpy2-example

Low-Level Approaches

Maybe the following is enough:

  1. write out data to disk, .e.g, as CSV/TSV files
  2. call an R script from Python with subprocess.call
    • R script writes results to disk
  3. read in data from disk again

If so - do it!

If you need more advanced features, stay tuned!

Using rpy2 (1)

Hello World Pi!

$ R --vanilla

> pi
[1] 3.141593

In Python:

In [1]: from rpy2 import robjects
   ...: robjects.r["pi"]
Out[1]:
<rpy2.robjects.vectors.FloatVector object at 0x7f66d67a5800> [RTYPES.REALSXP]
R classes: ('numeric',)
[3.141593]

Using rpy2 (2)

Student’s t-test

In [1]: import rpy2.robjects as robjects
   ...: from rpy2.robjects.packages import importr
   ...:
   ...: stats = importr('stats')
   ...:
   ...: group1 = robjects.FloatVector([23.5, 25.1, 28.3, 29.8, 30.5])
   ...: group2 = robjects.FloatVector([20.1, 22.5, 25.3, 27.9, 29.2])
   ...:
   ...: result = stats.t_test(group1, group2)
   ...:
   ...: print(result)

        Welch Two Sample t-test

data:  c(23.5, 25.1, 28.3, 29.8, 30.5) and c(20.1, 22.5, 25.3, 27.9, 29.2)
t = 1.1311, df = 7.656, p-value = 0.2922
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.573733  7.453733
sample estimates:
mean of x mean of y
    27.44     25.00

In [2]: result
Out[2]:
<rpy2.robjects.vectors.ListVector object at 0x7f668c5796c0> [RTYPES.VECSXP]
R classes: ('htest',)
[FloatSexp..., FloatSexp..., FloatSexp..., FloatSexp..., ..., FloatSexp..., StrSexpVe..., StrSexpVe..., StrSexpVe...]
  statistic: <class 'rpy2.robjects.vectors.FloatVector'>
  <rpy2.robjects.vectors.FloatVector object at 0x7f668c6a3700> [RTYPES.REALSXP]
R classes: ('numeric',)
[1.131085]
  parameter: <class 'rpy2.robjects.vectors.FloatVector'>
  <rpy2.robjects.vectors.FloatVector object at 0x7f668c700b80> [RTYPES.REALSXP]
R classes: ('numeric',)
[7.656022]
  p.value: <class 'rpy2.robjects.vectors.FloatVector'>
  <rpy2.robjects.vectors.FloatVector object at 0x7f668c700f80> [RTYPES.REALSXP]
R classes: ('numeric',)
[0.292203]
  conf.int: <class 'rpy2.robjects.vectors.FloatVector'>
  <rpy2.robjects.vectors.FloatVector object at 0x7f668c57ad00> [RTYPES.REALSXP]
R classes: ('numeric',)
[-2.573733, 7.453733]
...
  null.value: <class 'rpy2.robjects.vectors.FloatVector'>
  <rpy2.robjects.vectors.FloatVector object at 0x7f668c579780> [RTYPES.REALSXP]
R classes: ('numeric',)
[2.157220]
  stderr: <class 'rpy2.robjects.vectors.StrVector'>
  <rpy2.robjects.vectors.StrVector object at 0x7f668c6c1100> [RTYPES.STRSXP]
R classes: ('character',)
['two.sided']
  alternative: <class 'rpy2.robjects.vectors.StrVector'>
  <rpy2.robjects.vectors.StrVector object at 0x7f668c549ac0> [RTYPES.STRSXP]
R classes: ('character',)
['Welch Two Sample t-test']
  method: <class 'rpy2.robjects.vectors.StrVector'>
  <rpy2.robjects.vectors.StrVector object at 0x7f668c5499c0> [RTYPES.STRSXP]
R classes: ('character',)
['c(23.5, 25.1, 28.3, 29.8, 30.5) and c(20.1, 22.5...]

Using rpy2 (3)

From Pandas to R

In [1]: import pandas as pd
   ...: import rpy2.robjects as ro
   ...: from rpy2.robjects.packages import importr
   ...: from rpy2.robjects import pandas2ri
   ...:
   ...: pd_df = pd.DataFrame({'int_values': [1,2,3],
   ...:                       'str_values': ['abc', 'def', 'ghi']})
   ...:
   ...: with (ro.default_converter + pandas2ri.converter).context():
   ...:   r_from_pd_df = ro.conversion.get_conversion().py2rpy(pd_df)
In [2]: base = importr('base')
   ...:
   ...: with (ro.default_converter + pandas2ri.converter).context():
   ...:   df_summary = base.summary(pd_df)
   ...: print(df_summary)
   int_values   str_values
 Min.   :1.0   Length:3
 1st Qu.:1.5   Class :character
 Median :2.0   Mode  :character
 Mean   :2.0
 3rd Qu.:2.5
 Max.   :3.0

Using rpy2 (4)

From R to Pandas

In [1]: import pandas as pd
   ...: import rpy2.robjects as ro
   ...: from rpy2.robjects.packages import importr
   ...: from rpy2.robjects import pandas2ri
   ...:
   ...: r_df = ro.DataFrame({'int_values': ro.IntVector([1,2,3]),
   ...:                      'str_values': ro.StrVector(['abc', 'def', 'ghi'])})

In [2]: base = importr('base')
   ...:
   ...: with (ro.default_converter + pandas2ri.converter).context():
   ...:   df_summary = base.summary(r_df)
   ...: print(df_summary)
   int_values   str_values
 Min.   :1.0   Length:3
 1st Qu.:1.5   Class :character
 Median :2.0   Mode  :character
 Mean   :2.0
 3rd Qu.:2.5
 Max.   :3.0

Using rpy2 (5)

Installing Packages

import rpy2.robjects.packages as rpackages
utils = rpackages.importr("utils")
utils.chooseCRANmirror(ind=1) # select the first mirror in the list

from rpy2.robjects.vectors import StrVector
packnames = ("ggplot2", "lattice", "lazyeval")
names_to_install = [x for x in packnames if not rpackages.isinstalled(x)]
if len(names_to_install) > 0:
    utils.install_packages(StrVector(names_to_install))

Using rpy2 (6)

Plotting with ggplot2

import numpy as np
import pandas as pd
import rpy2.robjects.packages as packages
import rpy2.robjects.lib.ggplot2 as ggplot2
import rpy2.robjects as ro

R = ro.r
datasets = packages.importr("datasets")
mtcars = packages.data(datasets).fetch("mtcars")["mtcars"]

gp = ggplot2.ggplot(mtcars)
pp = (gp 
      + ggplot2.aes_string(x="wt", y="mpg")
      + ggplot2.geom_point(ggplot2.aes_string(colour="qsec"))
      + ggplot2.scale_colour_gradient(low="yellow", high="red") 
      + ggplot2.geom_smooth(method="auto") 
      + ggplot2.labs(title="mtcars", x="wt", y="mpg"))

pp.plot()
R("dev.copy(png,'/tmp/out.png')")

Machine Learning scikit-learn

  • Introduction
  • Estimator Cheat Sheet
  • Example: Clustering
  • Example: Regression
  • Example: Classification

Introduction

  • scikit-learn is a a popular open-source machine learning library for Python
  • main features:
    • simple and efficient tools for data mining / analysis
    • built on top of libraries such as NumPy, SciPy, matplotlib
    • wide range of ML algorithms
    • easy of use, excellent documentation
  • we will cover only a small part

Estimator Cheat Sheet

Example: Clustering

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate synthetic data
data, _ = make_blobs(n_samples=300, centers=4, random_state=42)
# Create K-Means model
kmeans = KMeans(n_clusters=4)
# Fit the model to the data
kmeans.fit(data)
# Predict clusters for each data point
labels = kmeans.labels_
# Get cluster centers
cluster_centers = kmeans.cluster_centers_

plt.scatter(data[:, 0], data[:, 1], c=labels, cmap='viridis')
plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1], c='red', marker='x', s=200)
plt.title('K-Means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

Example: Regression

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Generate synthetic regression data
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create Linear Regression model
model = LinearRegression()
# Fit the model to the training data
model.fit(X_train, y_train)
# Make predictions on the test data
y_pred = model.predict(X_test)
# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)

# Visualize the data and regression line
plt.scatter(X_test, y_test, label='Test Data')
plt.plot(X_test, y_pred, color='red', label='Regression Line')
plt.title(f'Linear Regression (MSE: {mse:.2f})')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.legend()
plt.show()

Example: Classification

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# Generate synthetic classification data
X, y = make_classification(n_samples=200, n_features=2, n_informative=2,
                           n_redundant=0, n_clusters_per_class=1, random_state=42)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create Logistic Regression model
model = LogisticRegression()
# Fit the model to the training data
model.fit(X_train, y_train)
# Make predictions on the test data
y_pred = model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
# Create a confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Visualize the data and decision boundary
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap='coolwarm', marker='o', label='True Class')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

# Create a mesh to plot decision boundary
x_min, x_max = X_test[:, 0].min() - 1, X_test[:, 0].max() + 1
y_min, y_max = X_test[:, 1].min() - 1, X_test[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.4, cmap='coolwarm')

plt.title(f'Logistic Regression (Accuracy: {accuracy:.2f})')
plt.legend()
plt.show()

Deep Learning with Tensorflow

  • Introduction
  • Installation
  • “TensorFlow 2 Quickstart for Beginners”
  • Running on the HPC

Introduction

  • TensorFlow: popular open-source library for machine learning, by Google
  • main features:
    • high-performance numerical computations
    • deep learning
    • easy of use, excellent documentation
  • applications: “you name it”, e.g.
    • image classification
    • bioinformatics

Warning

  • Deep learning is a huge topic
  • We will only scratch the surface
  • I’m not an expert 🙉🙈🙊
  • This is just to get you off the ground
    • in using GPUs for deep learning
    • on our HPC
  • … but chances are you will start out using preexisting tools anyway

Installation

Create a new conda environment (this may take a while):

mamba create -y -n python-tf tensorflow-gpu
conda activate python-tf

Verify that it works in priciple:

python --version
# OUTPUT: Python 3.9.10
python -c 'import tensorflow; print(tensorflow.__version__)'
# OUTPUT: 2.6.2
python -c 'import tensorflow as tf; print(tf.config.list_physical_devices())'
# OUTPUT: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
# OUTPUT: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (hpc-cpu-88): /proc/driver/nvidia/version does not exist
# OUTPUT: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]

On To The GPUs!

Deep learning is all fun and giggles until you realize that it is slow on CPUs.

hpc-login-1 $ srun --mem=10g --partition=gpu --gres=gpu:tesla:1 --pty bash -i
hpc-gpu-1 $ conda activate python-tf
hpc-gpu-1 $ python -c 'import tensorflow as tf; print(tf.config.list_physical_devices())'
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

“TensorFlow 2 Quickstart for Beginners”

$ python
>>> import tensorflow as tf
>>> mnist = tf.keras.datasets.mnist
>>> (x_train, y_train), (x_test, y_test) = mnist.load_data()
>>> x_train, x_test = x_train / 255.0, x_test / 255.0
>>> model = tf.keras.models.Sequential([
...   tf.keras.layers.Flatten(input_shape=(28, 28)),
...   tf.keras.layers.Dense(128, activation='relu'),
...   tf.keras.layers.Dropout(0.2),
...   tf.keras.layers.Dense(10)
... ])
>>> predictions = model(x_train[:1]).numpy()
>>> predictions
array([[-0.50569224,  0.26386747,  0.43226188,  0.61226094,  0.09630793,
         0.34400576,  0.9819117 , -0.3693726 ,  0.5221357 ,  0.3323232 ]],
      dtype=float32)
>>> tf.nn.softmax(predictions).numpy()
array([[0.04234391, 0.09141268, 0.10817807, 0.12951255, 0.07731011,
        0.09903987, 0.18743432, 0.04852816, 0.11835073, 0.09788957]],
      dtype=float32)
>>> loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
>>> loss_fn(y_train[:1], predictions).numpy()
2.3122327
>>> model.compile(optimizer='adam',
...               loss=loss_fn,
...               metrics=['accuracy'])
>>> model.fit(x_train, y_train, epochs=5)
2022-03-09 17:53:47.237997: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.2918 - accuracy: 0.9151
Epoch 2/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.1444 - accuracy: 0.9561
Epoch 3/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.1082 - accuracy: 0.9674
Epoch 4/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.0898 - accuracy: 0.9720
Epoch 5/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.0773 - accuracy: 0.9756
<keras.callbacks.History object at 0x154e81360190>
>>> model.evaluate(x_test,  y_test, verbose=2)
313/313 - 0s - loss: 0.0713 - accuracy: 0.9785
[0.0713074803352356, 0.9785000085830688]
>>> probability_model = tf.keras.Sequential([
...   model,
...   tf.keras.layers.Softmax()
... ])
>>> probability_model(x_test[:5])
<tf.Tensor: shape=(5, 10), dtype=float32, numpy=
array([[1.2339272e-06, 6.5599060e-10, 1.0560590e-06, 5.9356184e-06,
        5.3691075e-12, 1.4447859e-07, 5.4218874e-13, 9.9996936e-01,
        1.0347234e-07, 2.2147648e-05],
       [2.9887938e-06, 6.8461006e-05, 9.9991941e-01, 7.2003731e-06,
        2.9751782e-13, 8.2818183e-08, 1.4307782e-06, 2.3203837e-13,
        4.7433215e-07, 2.9504194e-14],
       [1.8058477e-06, 9.9928612e-01, 7.8716243e-05, 3.9140195e-06,
        3.0842333e-05, 9.4537208e-06, 2.2774333e-05, 4.5549971e-04,
        1.1015874e-04, 6.9138093e-07],
       [9.9978787e-01, 3.0206781e-08, 2.8528208e-05, 8.5581682e-08,
        1.3851340e-07, 2.3634559e-06, 1.8480707e-05, 1.0153375e-04,
        1.1583331e-07, 6.0887167e-05],
       [6.4914235e-07, 2.5808356e-08, 1.8225538e-06, 2.3215563e-09,
        9.9588013e-01, 4.6049720e-08, 3.8903639e-07, 2.9772724e-05,
        4.3141077e-07, 4.0867776e-03]], dtype=float32)>
>>> exit()

TensorFlow Slurm Jobs (1)

tf_script.py

#/usr/bin/env python

import tensorflow as tf
print("TensorFlow version:", tf.__version__)
print(tf.config.list_physical_devices())

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0


model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
])

predictions = model(x_train[:1]).numpy()
print(predictions)

print(tf.nn.softmax(predictions).numpy())

# ... and so on ;-)

TensorFlow Slurm Jobs (2)

tf_job.sh

#!/usr/bin/bash

#SBATCH --job-name=tf-job
#SBATCH --mem=10g
#SBATCH --partition=gpu
#SBATCH --gres=gpu:tesla:1

source $HOME/work/miniconda3/bin/activate
conda activate python-tf

python tf_script.py &>tf-out.txt

TensorFlow Slurm Jobs (3)

Submit by

sbatch tf_job.sh

Log output:

$ cat tf-out.txt 
2022-03-09 18:05:54.628846: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-09 18:05:56.999848: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30988 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:18:00.0, compute capability: 7.0
TensorFlow version: 2.6.2
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
[[-0.07757086  0.04676083  0.9420195  -0.59902835 -0.26286742 -0.392514
   0.3231195  -0.17169198  0.3480805   0.37013203]]
[[0.07963609 0.09017922 0.22075593 0.04727634 0.06616627 0.05812084
  0.11888511 0.07248258 0.12188996 0.12460768]]

Bring Your Own Project

🫵 Where can you apply what you have learned in your PhD project?

This is not the end…

… but all for this session

Recap

  • Overview of Python in scientific programming
  • Data wrangling with Polars
  • Numerical computations with SciPy/NumPy
  • Interfacing with R for Statistics
  • Machine Learning with scikit-learn
  • (A glimpse of) deep Learning with Tensorflow